Aplication Data¶
Elaborador por: Jacqueline Fernández Ramírez y Jazmín Fernández Ramírez.
Máster Universitario de Ciencia de Datos (MUCD), CUNEF.
El conjunto de datos Bank Account Fraud (BAF) se publicó en NeurIPS 2022 y comprende un total de 6 conjuntos de datos tabulares sintéticos diferentes sobre fraude en cuentas bancarias. BAF es un banco de pruebas realista, completo y sólido para evaluar métodos nuevos y existentes en ML y ML justo
Definición del problema¶
El principal desafío que este proyecto de aprendizaje automático procura resolver es la detección y reducción del peligro de fraude en peticiones de préstamos bancarios. Mediante el uso del conjunto de datos "Bank Account Fraud (BAF)", la finalidad es elaborar un modelo predictivo capaz de diferenciar entre los solicitantes que pueden pagar el préstamo y los que podrían cometer fraude o enfrentar problemas para cumplir con sus compromisos financieros. El reto consiste en identificar patrones y rasgos en la información que sugieran conductas fraudulentas, garantizando simultáneamente que los clientes confiables y financieramente estables no sean desestimados de manera injusta.
Objetivo de la práctica¶
El propósito de este proyecto es emplear métodos de Análisis Exploratorio de Datos (EDA) con el fin de entender y examinar los patrones que se encuentran en el conjunto de datos "Bank Account Fraud (BAF)", publicado en NeurIPS 2022). Este estudio tiene un objetivo concreto: ayudar a un banco en la valoración de peticiones de créditos y en la detección de posibles fraudes. Al emplear información auténtica vinculada a fraudes en cuentas bancarias, se pretende asegurar que los solicitantes con la capacidad de saldar el préstamo no sean rechazados de manera injusta. Al mismo tiempo, se busca detectar patrones que muestren si un cliente podría enfrentar problemas para reembolsar el préstamo, lo que resulta vital para la toma de decisiones anticipadas y la reducción de riesgos. Con el empleo de este EDA, se busca identificar patrones esenciales que faciliten la mejora de los procesos de aprobación de préstamos y disminuyan la prevalencia de fraudes financieros.
Ejecución del modelo: El modelo se va a ejecutar en el momento en el que se solicita un préstamo. Por esta razón, en este caso de estudio no se toman en consideración variables a futuro, ya que todos los datos se rellenarán a la hora de solicitar el préstamo.
Pasos a realizar¶
- Análisis inicial de los datos y preprocesamiento inicial
- Correlaciones, tratamiento de missing y outliers
- Tratamiento de variables categóricas: enconding
- Escalado de las variables
Importación de librerías¶
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5000)
Revisión del directorio de trabajo¶
import os #paquete de lectura de directorio de trabajo
directorio_actual = os.getcwd()
print("Directorio actual:", directorio_actual)
Directorio actual: C:\Users\jacki\OneDrive\Documentos\CUNEF\Aprendizaje_Automatico\practica1_EDA\practica1_EDA\notebooks
Lectura del dataframe¶
file_path = '../data/application_data.csv'
df_application = pd.read_csv(file_path, sep = ',') # low_memory hace que independientemente del tipo muestre los datos
df_application
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.0220 | 0.0198 | 0.0 | 0.0000 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.0790 | 0.0554 | 0.0 | 0.0000 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.0100 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | 456251 | 0 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | 225000.0 | Unaccompanied | Working | Secondary / secondary special | Separated | With parents | 0.032561 | -9327 | -236 | -8456.0 | -1982 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Sales staff | 1.0 | 1 | 1 | THURSDAY | 15 | 0 | 0 | 0 | 0 | 0 | 0 | Services | 0.145570 | 0.681632 | NaN | 0.2021 | 0.0887 | 0.9876 | 0.8300 | 0.0202 | 0.22 | 0.1034 | 0.6042 | 0.2708 | 0.0594 | 0.1484 | 0.1965 | 0.0753 | 0.1095 | 0.1008 | 0.0172 | 0.9782 | 0.7125 | 0.0172 | 0.0806 | 0.0345 | 0.4583 | 0.0417 | 0.0094 | 0.0882 | 0.0853 | 0.0 | 0.0125 | 0.2040 | 0.0887 | 0.9876 | 0.8323 | 0.0203 | 0.22 | 0.1034 | 0.6042 | 0.2708 | 0.0605 | 0.1509 | 0.2001 | 0.0757 | 0.1118 | reg oper account | block of flats | 0.2898 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -273.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307507 | 456252 | 0 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | 225000.0 | Unaccompanied | Pensioner | Secondary / secondary special | Widow | House / apartment | 0.025164 | -20775 | 365243 | -4388.0 | -4090 | NaN | 1 | 0 | 0 | 1 | 1 | 0 | NaN | 1.0 | 2 | 2 | MONDAY | 8 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | NaN | 0.115992 | NaN | 0.0247 | 0.0435 | 0.9727 | 0.6260 | 0.0022 | 0.00 | 0.1034 | 0.0833 | 0.1250 | 0.0579 | 0.0202 | 0.0257 | 0.0000 | 0.0000 | 0.0252 | 0.0451 | 0.9727 | 0.6406 | 0.0022 | 0.0000 | 0.1034 | 0.0833 | 0.1250 | 0.0592 | 0.0220 | 0.0267 | 0.0 | 0.0000 | 0.0250 | 0.0435 | 0.9727 | 0.6310 | 0.0022 | 0.00 | 0.1034 | 0.0833 | 0.1250 | 0.0589 | 0.0205 | 0.0261 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.0214 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307508 | 456253 | 0 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | 585000.0 | Unaccompanied | Working | Higher education | Separated | House / apartment | 0.005002 | -14966 | -7921 | -6737.0 | -5150 | NaN | 1 | 1 | 0 | 1 | 0 | 1 | Managers | 1.0 | 3 | 3 | THURSDAY | 9 | 0 | 0 | 0 | 0 | 1 | 1 | School | 0.744026 | 0.535722 | 0.218859 | 0.1031 | 0.0862 | 0.9816 | 0.7484 | 0.0123 | 0.00 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0841 | 0.9279 | 0.0000 | 0.0000 | 0.1050 | 0.0894 | 0.9816 | 0.7583 | 0.0124 | 0.0000 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0918 | 0.9667 | 0.0 | 0.0000 | 0.1041 | 0.0862 | 0.9816 | 0.7518 | 0.0124 | 0.00 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0855 | 0.9445 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.7970 | Panel | No | 6.0 | 0.0 | 6.0 | 0.0 | -1909.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 307509 | 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | 319500.0 | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | 0.005313 | -11961 | -4786 | -2562.0 | -931 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 9 | 0 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 1 | NaN | 0.514163 | 0.661024 | 0.0124 | NaN | 0.9771 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0061 | NaN | NaN | 0.0126 | NaN | 0.9772 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0063 | NaN | NaN | 0.0125 | NaN | 0.9771 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0062 | NaN | NaN | NaN | block of flats | 0.0086 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -322.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307510 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | 675000.0 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.046220 | -16856 | -1262 | -5128.0 | -410 | NaN | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 2.0 | 1 | 1 | THURSDAY | 20 | 0 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.734460 | 0.708569 | 0.113922 | 0.0742 | 0.0526 | 0.9881 | NaN | 0.0176 | 0.08 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0791 | NaN | 0.0000 | 0.0756 | 0.0546 | 0.9881 | NaN | 0.0178 | 0.0806 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0824 | NaN | 0.0000 | 0.0749 | 0.0526 | 0.9881 | NaN | 0.0177 | 0.08 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0805 | NaN | 0.0000 | NaN | block of flats | 0.0718 | Panel | No | 0.0 | 0.0 | 0.0 | 0.0 | -787.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 |
307511 rows × 122 columns
Funciones¶
#Función para ver candidatas o variables categóricas
def dame_variables_categoricas(dataset=None):
'''
----------------------------------------------------------------------------------------------------------
Función dame_variables_categoricas:
----------------------------------------------------------------------------------------------------------
-Descripción: Función que recibe un dataset y devuelve una lista con los nombres de las
variables categóricas
-Inputs:
-- dataset: Pandas dataframe que contiene los datos
-Return:
-- lista_variables_categoricas: lista con los nombres de las variables categóricas del
dataset de entrada con menos de 100 valores diferentes
-- 1: la ejecución es incorrecta
'''
if dataset is None:
print(u'\nFaltan argumentos por pasar a la función')
return 1
lista_variables_categoricas = []
other = []
for i in dataset.columns:
if (dataset[i].dtype!=float) & (dataset[i].dtype!=int):
unicos = int(len(np.unique(dataset[i].dropna(axis=0, how='all'))))
if unicos < 100:
lista_variables_categoricas.append(i)
else:
other.append(i)
return lista_variables_categoricas, other
dame_variables_categoricas(df_application)
(['TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'], ['SK_ID_CURR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH'])
Análisis generales de la tabla¶
Dimensión
#Dimensión
print(df_application.shape, df_application.drop_duplicates().shape)
(307511, 122) (307511, 122)
Tipos de datos
#Tipos de datos
df_application.dtypes.to_dict()
{'SK_ID_CURR': dtype('int64'),
'TARGET': dtype('int64'),
'NAME_CONTRACT_TYPE': dtype('O'),
'CODE_GENDER': dtype('O'),
'FLAG_OWN_CAR': dtype('O'),
'FLAG_OWN_REALTY': dtype('O'),
'CNT_CHILDREN': dtype('int64'),
'AMT_INCOME_TOTAL': dtype('float64'),
'AMT_CREDIT': dtype('float64'),
'AMT_ANNUITY': dtype('float64'),
'AMT_GOODS_PRICE': dtype('float64'),
'NAME_TYPE_SUITE': dtype('O'),
'NAME_INCOME_TYPE': dtype('O'),
'NAME_EDUCATION_TYPE': dtype('O'),
'NAME_FAMILY_STATUS': dtype('O'),
'NAME_HOUSING_TYPE': dtype('O'),
'REGION_POPULATION_RELATIVE': dtype('float64'),
'DAYS_BIRTH': dtype('int64'),
'DAYS_EMPLOYED': dtype('int64'),
'DAYS_REGISTRATION': dtype('float64'),
'DAYS_ID_PUBLISH': dtype('int64'),
'OWN_CAR_AGE': dtype('float64'),
'FLAG_MOBIL': dtype('int64'),
'FLAG_EMP_PHONE': dtype('int64'),
'FLAG_WORK_PHONE': dtype('int64'),
'FLAG_CONT_MOBILE': dtype('int64'),
'FLAG_PHONE': dtype('int64'),
'FLAG_EMAIL': dtype('int64'),
'OCCUPATION_TYPE': dtype('O'),
'CNT_FAM_MEMBERS': dtype('float64'),
'REGION_RATING_CLIENT': dtype('int64'),
'REGION_RATING_CLIENT_W_CITY': dtype('int64'),
'WEEKDAY_APPR_PROCESS_START': dtype('O'),
'HOUR_APPR_PROCESS_START': dtype('int64'),
'REG_REGION_NOT_LIVE_REGION': dtype('int64'),
'REG_REGION_NOT_WORK_REGION': dtype('int64'),
'LIVE_REGION_NOT_WORK_REGION': dtype('int64'),
'REG_CITY_NOT_LIVE_CITY': dtype('int64'),
'REG_CITY_NOT_WORK_CITY': dtype('int64'),
'LIVE_CITY_NOT_WORK_CITY': dtype('int64'),
'ORGANIZATION_TYPE': dtype('O'),
'EXT_SOURCE_1': dtype('float64'),
'EXT_SOURCE_2': dtype('float64'),
'EXT_SOURCE_3': dtype('float64'),
'APARTMENTS_AVG': dtype('float64'),
'BASEMENTAREA_AVG': dtype('float64'),
'YEARS_BEGINEXPLUATATION_AVG': dtype('float64'),
'YEARS_BUILD_AVG': dtype('float64'),
'COMMONAREA_AVG': dtype('float64'),
'ELEVATORS_AVG': dtype('float64'),
'ENTRANCES_AVG': dtype('float64'),
'FLOORSMAX_AVG': dtype('float64'),
'FLOORSMIN_AVG': dtype('float64'),
'LANDAREA_AVG': dtype('float64'),
'LIVINGAPARTMENTS_AVG': dtype('float64'),
'LIVINGAREA_AVG': dtype('float64'),
'NONLIVINGAPARTMENTS_AVG': dtype('float64'),
'NONLIVINGAREA_AVG': dtype('float64'),
'APARTMENTS_MODE': dtype('float64'),
'BASEMENTAREA_MODE': dtype('float64'),
'YEARS_BEGINEXPLUATATION_MODE': dtype('float64'),
'YEARS_BUILD_MODE': dtype('float64'),
'COMMONAREA_MODE': dtype('float64'),
'ELEVATORS_MODE': dtype('float64'),
'ENTRANCES_MODE': dtype('float64'),
'FLOORSMAX_MODE': dtype('float64'),
'FLOORSMIN_MODE': dtype('float64'),
'LANDAREA_MODE': dtype('float64'),
'LIVINGAPARTMENTS_MODE': dtype('float64'),
'LIVINGAREA_MODE': dtype('float64'),
'NONLIVINGAPARTMENTS_MODE': dtype('float64'),
'NONLIVINGAREA_MODE': dtype('float64'),
'APARTMENTS_MEDI': dtype('float64'),
'BASEMENTAREA_MEDI': dtype('float64'),
'YEARS_BEGINEXPLUATATION_MEDI': dtype('float64'),
'YEARS_BUILD_MEDI': dtype('float64'),
'COMMONAREA_MEDI': dtype('float64'),
'ELEVATORS_MEDI': dtype('float64'),
'ENTRANCES_MEDI': dtype('float64'),
'FLOORSMAX_MEDI': dtype('float64'),
'FLOORSMIN_MEDI': dtype('float64'),
'LANDAREA_MEDI': dtype('float64'),
'LIVINGAPARTMENTS_MEDI': dtype('float64'),
'LIVINGAREA_MEDI': dtype('float64'),
'NONLIVINGAPARTMENTS_MEDI': dtype('float64'),
'NONLIVINGAREA_MEDI': dtype('float64'),
'FONDKAPREMONT_MODE': dtype('O'),
'HOUSETYPE_MODE': dtype('O'),
'TOTALAREA_MODE': dtype('float64'),
'WALLSMATERIAL_MODE': dtype('O'),
'EMERGENCYSTATE_MODE': dtype('O'),
'OBS_30_CNT_SOCIAL_CIRCLE': dtype('float64'),
'DEF_30_CNT_SOCIAL_CIRCLE': dtype('float64'),
'OBS_60_CNT_SOCIAL_CIRCLE': dtype('float64'),
'DEF_60_CNT_SOCIAL_CIRCLE': dtype('float64'),
'DAYS_LAST_PHONE_CHANGE': dtype('float64'),
'FLAG_DOCUMENT_2': dtype('int64'),
'FLAG_DOCUMENT_3': dtype('int64'),
'FLAG_DOCUMENT_4': dtype('int64'),
'FLAG_DOCUMENT_5': dtype('int64'),
'FLAG_DOCUMENT_6': dtype('int64'),
'FLAG_DOCUMENT_7': dtype('int64'),
'FLAG_DOCUMENT_8': dtype('int64'),
'FLAG_DOCUMENT_9': dtype('int64'),
'FLAG_DOCUMENT_10': dtype('int64'),
'FLAG_DOCUMENT_11': dtype('int64'),
'FLAG_DOCUMENT_12': dtype('int64'),
'FLAG_DOCUMENT_13': dtype('int64'),
'FLAG_DOCUMENT_14': dtype('int64'),
'FLAG_DOCUMENT_15': dtype('int64'),
'FLAG_DOCUMENT_16': dtype('int64'),
'FLAG_DOCUMENT_17': dtype('int64'),
'FLAG_DOCUMENT_18': dtype('int64'),
'FLAG_DOCUMENT_19': dtype('int64'),
'FLAG_DOCUMENT_20': dtype('int64'),
'FLAG_DOCUMENT_21': dtype('int64'),
'AMT_REQ_CREDIT_BUREAU_HOUR': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_DAY': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_WEEK': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_MON': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_QRT': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_YEAR': dtype('float64')}
A partir de un análisis general del dataframe se puede destacar que la tabla cuenta con 307511 filas (observaciones) y 122 columnas (variables de estudio). Asimismo, se encuentran datos de tipo flotante, enteros y booleanos.
Exploración de la variable objetivo y tratamiento¶
df_application['TARGET'].value_counts()
0 282686 1 24825 Name: TARGET, dtype: int64
#distrbución de la variable objetivo
df_application['TARGET'].value_counts().plot(kind='bar', title='Distribución de TARGET')
plt.show()
#exploración de la variable objetivo y tratamiento
pd_plot_target = df_application['TARGET']\
.value_counts(normalize=True)\
.mul(100).rename('percent').reset_index()
pd_plot_target_conteo = df_application['TARGET'].value_counts().reset_index()
pd_plot_target_pc = pd.merge(pd_plot_target, pd_plot_target_conteo, on=['index'], how='inner')
pd_plot_target_pc
| index | percent | TARGET | |
|---|---|---|---|
| 0 | 0 | 91.927118 | 282686 |
| 1 | 1 | 8.072882 | 24825 |
Distribución de la Variable Objetivo:
TARGET 0: Representa el 91.93% de los datos, con 282,686 registros.
TARGET 1: Representa el 8.07% de los datos, con 24,825 registros.
Esto indica que existe un desbalance significativo en la variable objetivo, donde casi el 92% de los registros pertenecen a la clase 0 y solo el 8% a la clase 1. Esto subraya la importancia de abordar en un futuro este desbalance para construir un modelo adecuado y efectivo de machine learning.
#histograma
import plotly.express as px
fig = px.histogram(pd_plot_target_pc, x="index", y=['percent'])
fig.show()
Análisis de la variable objetivo y decisión final respecto a su tratamiento¶
Posterior al análisis efectuado, se precisa que en este caso el 1 representa el cliente que tuvo pagos retrasados por más de X días en al menos una de las primeras Y cuotas del préstamo en la muestra, y 0 indica que el cliente no tuvo pagos retrasados significativos en esas primeras cuotas. La información sobre los pagos retrasados (valor 1) es fundamental para evaluar el riesgo de conceder préstamos. Los modelos de crédito utilizan estos datos para predecir la probabilidad de que un cliente incumpla en el futuro. Por este motivo, se mantienen ambos parámetros debido a que se busca generar modelos predictivos que puedan identificar patrones en el comportamiento de pago de los clientes.
Selección de threshold por filas y columnas para eliminar valores missing¶
Contar y ordenar los valores nulos por columnas y filas.
Crear DataFrames para almacenar estos conteos y porcentajes de nulos.
Filtrar las columnas que tienen menos del umbral de valores nulos.
pd_series_null_columns = df_application.isnull().sum().sort_values(ascending=False)
pd_series_null_rows = df_application.isnull().sum(axis=1).sort_values(ascending=False)
print(pd_series_null_columns.shape, pd_series_null_rows.shape)
pd_null_columnas = pd.DataFrame(pd_series_null_columns, columns=['nulos_columnas'])
pd_null_filas = pd.DataFrame(pd_series_null_rows, columns=['nulos_filas'])
pd_null_filas['target'] = df_application['TARGET'].copy()
pd_null_columnas['porcentaje_columnas'] = pd_null_columnas['nulos_columnas']/df_application.shape[0]
pd_null_filas['porcentaje_filas']= pd_null_filas['nulos_filas']/df_application.shape[1]
## pd_null
(122,) (307511,)
df_application.shape
(307511, 122)
pd_null_columnas
| nulos_columnas | porcentaje_columnas | |
|---|---|---|
| COMMONAREA_MEDI | 214865 | 0.698723 |
| COMMONAREA_AVG | 214865 | 0.698723 |
| COMMONAREA_MODE | 214865 | 0.698723 |
| NONLIVINGAPARTMENTS_MODE | 213514 | 0.694330 |
| NONLIVINGAPARTMENTS_AVG | 213514 | 0.694330 |
| NONLIVINGAPARTMENTS_MEDI | 213514 | 0.694330 |
| FONDKAPREMONT_MODE | 210295 | 0.683862 |
| LIVINGAPARTMENTS_MODE | 210199 | 0.683550 |
| LIVINGAPARTMENTS_AVG | 210199 | 0.683550 |
| LIVINGAPARTMENTS_MEDI | 210199 | 0.683550 |
| FLOORSMIN_AVG | 208642 | 0.678486 |
| FLOORSMIN_MODE | 208642 | 0.678486 |
| FLOORSMIN_MEDI | 208642 | 0.678486 |
| YEARS_BUILD_MEDI | 204488 | 0.664978 |
| YEARS_BUILD_MODE | 204488 | 0.664978 |
| YEARS_BUILD_AVG | 204488 | 0.664978 |
| OWN_CAR_AGE | 202929 | 0.659908 |
| LANDAREA_MEDI | 182590 | 0.593767 |
| LANDAREA_MODE | 182590 | 0.593767 |
| LANDAREA_AVG | 182590 | 0.593767 |
| BASEMENTAREA_MEDI | 179943 | 0.585160 |
| BASEMENTAREA_AVG | 179943 | 0.585160 |
| BASEMENTAREA_MODE | 179943 | 0.585160 |
| EXT_SOURCE_1 | 173378 | 0.563811 |
| NONLIVINGAREA_MODE | 169682 | 0.551792 |
| NONLIVINGAREA_AVG | 169682 | 0.551792 |
| NONLIVINGAREA_MEDI | 169682 | 0.551792 |
| ELEVATORS_MEDI | 163891 | 0.532960 |
| ELEVATORS_AVG | 163891 | 0.532960 |
| ELEVATORS_MODE | 163891 | 0.532960 |
| WALLSMATERIAL_MODE | 156341 | 0.508408 |
| APARTMENTS_MEDI | 156061 | 0.507497 |
| APARTMENTS_AVG | 156061 | 0.507497 |
| APARTMENTS_MODE | 156061 | 0.507497 |
| ENTRANCES_MEDI | 154828 | 0.503488 |
| ENTRANCES_AVG | 154828 | 0.503488 |
| ENTRANCES_MODE | 154828 | 0.503488 |
| LIVINGAREA_AVG | 154350 | 0.501933 |
| LIVINGAREA_MODE | 154350 | 0.501933 |
| LIVINGAREA_MEDI | 154350 | 0.501933 |
| HOUSETYPE_MODE | 154297 | 0.501761 |
| FLOORSMAX_MODE | 153020 | 0.497608 |
| FLOORSMAX_MEDI | 153020 | 0.497608 |
| FLOORSMAX_AVG | 153020 | 0.497608 |
| YEARS_BEGINEXPLUATATION_MODE | 150007 | 0.487810 |
| YEARS_BEGINEXPLUATATION_MEDI | 150007 | 0.487810 |
| YEARS_BEGINEXPLUATATION_AVG | 150007 | 0.487810 |
| TOTALAREA_MODE | 148431 | 0.482685 |
| EMERGENCYSTATE_MODE | 145755 | 0.473983 |
| OCCUPATION_TYPE | 96391 | 0.313455 |
| EXT_SOURCE_3 | 60965 | 0.198253 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_DAY | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_MON | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_QRT | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 41519 | 0.135016 |
| NAME_TYPE_SUITE | 1292 | 0.004201 |
| OBS_30_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| DEF_30_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| OBS_60_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| DEF_60_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| EXT_SOURCE_2 | 660 | 0.002146 |
| AMT_GOODS_PRICE | 278 | 0.000904 |
| AMT_ANNUITY | 12 | 0.000039 |
| CNT_FAM_MEMBERS | 2 | 0.000007 |
| DAYS_LAST_PHONE_CHANGE | 1 | 0.000003 |
| CNT_CHILDREN | 0 | 0.000000 |
| FLAG_DOCUMENT_8 | 0 | 0.000000 |
| NAME_CONTRACT_TYPE | 0 | 0.000000 |
| CODE_GENDER | 0 | 0.000000 |
| FLAG_OWN_CAR | 0 | 0.000000 |
| FLAG_DOCUMENT_2 | 0 | 0.000000 |
| FLAG_DOCUMENT_3 | 0 | 0.000000 |
| FLAG_DOCUMENT_4 | 0 | 0.000000 |
| FLAG_DOCUMENT_5 | 0 | 0.000000 |
| FLAG_DOCUMENT_6 | 0 | 0.000000 |
| FLAG_DOCUMENT_7 | 0 | 0.000000 |
| FLAG_DOCUMENT_9 | 0 | 0.000000 |
| FLAG_DOCUMENT_21 | 0 | 0.000000 |
| FLAG_DOCUMENT_10 | 0 | 0.000000 |
| FLAG_DOCUMENT_11 | 0 | 0.000000 |
| FLAG_OWN_REALTY | 0 | 0.000000 |
| FLAG_DOCUMENT_13 | 0 | 0.000000 |
| FLAG_DOCUMENT_14 | 0 | 0.000000 |
| FLAG_DOCUMENT_15 | 0 | 0.000000 |
| FLAG_DOCUMENT_16 | 0 | 0.000000 |
| FLAG_DOCUMENT_17 | 0 | 0.000000 |
| FLAG_DOCUMENT_18 | 0 | 0.000000 |
| FLAG_DOCUMENT_19 | 0 | 0.000000 |
| FLAG_DOCUMENT_20 | 0 | 0.000000 |
| FLAG_DOCUMENT_12 | 0 | 0.000000 |
| AMT_CREDIT | 0 | 0.000000 |
| AMT_INCOME_TOTAL | 0 | 0.000000 |
| FLAG_PHONE | 0 | 0.000000 |
| LIVE_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| REG_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| TARGET | 0 | 0.000000 |
| REG_CITY_NOT_LIVE_CITY | 0 | 0.000000 |
| LIVE_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| REG_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| REG_REGION_NOT_LIVE_REGION | 0 | 0.000000 |
| HOUR_APPR_PROCESS_START | 0 | 0.000000 |
| WEEKDAY_APPR_PROCESS_START | 0 | 0.000000 |
| REGION_RATING_CLIENT_W_CITY | 0 | 0.000000 |
| REGION_RATING_CLIENT | 0 | 0.000000 |
| FLAG_EMAIL | 0 | 0.000000 |
| FLAG_CONT_MOBILE | 0 | 0.000000 |
| ORGANIZATION_TYPE | 0 | 0.000000 |
| FLAG_WORK_PHONE | 0 | 0.000000 |
| FLAG_EMP_PHONE | 0 | 0.000000 |
| FLAG_MOBIL | 0 | 0.000000 |
| DAYS_ID_PUBLISH | 0 | 0.000000 |
| DAYS_REGISTRATION | 0 | 0.000000 |
| DAYS_EMPLOYED | 0 | 0.000000 |
| DAYS_BIRTH | 0 | 0.000000 |
| REGION_POPULATION_RELATIVE | 0 | 0.000000 |
| NAME_HOUSING_TYPE | 0 | 0.000000 |
| NAME_FAMILY_STATUS | 0 | 0.000000 |
| NAME_EDUCATION_TYPE | 0 | 0.000000 |
| NAME_INCOME_TYPE | 0 | 0.000000 |
| SK_ID_CURR | 0 | 0.000000 |
Porcentaje de valores nulos: A partir de la tabla anterior, se puede identificar que varias columnas del conjunto de datos tienen un elevado porcentaje de valores nulos. Por ejemplo, COMMONAREA_MEDI, COMMONAREA_AVG, y COMMONAREA_MODE tienen cerca del 69.87% de valores nulos, lo que señala que estos datos están mayoritariamente ausentes. Similarmente, NONLIVINGAPARTMENTS_MODE, NONLIVINGAPARTMENTS_AVG y NONLIVINGAPARTMENTS_MEDI son variables que tienen aproximadamente el 69.43% de valores nulos. Esto sugiere que estas variables podrían requerir de un tratamiento especial, como imputación de valores o eliminación de las columnas, si se considera que no aportan información significativa al caso de estudio. Detectar estas columnas con altos porcentajes de valores nulos es crucial para decidir cómo manejarlas en el preprocesamiento y, de este modo, asegurar la integridad del análisis subsiguiente.
Threshold=0.9: Si se obtiene exactamente el mismo número de filas y columnas después de aplicar el filtro threshold=0.9, esto indica que ninguna de las columnas tiene más del 90% de valores nulos en el DataFrame df_application. Por lo tanto, todas las columnas cumplen con el criterio y se incluyen en el DataFrame pd_target_filter_null.
threshold=0.9
list_vars_not_null = list(pd_null_columnas[pd_null_columnas['porcentaje_columnas']<threshold].index)
pd_target_filter_null = df_application.loc[:, list_vars_not_null]
pd_target_filter_null.shape
(307511, 122)
pd_null_filas
| nulos_filas | TARGET | porcentaje_filas | |
|---|---|---|---|
| 185713 | 61 | 0 | 0.5 |
| 133770 | 61 | 0 | 0.5 |
| 197736 | 61 | 0 | 0.5 |
| 116937 | 61 | 0 | 0.5 |
| 269492 | 61 | 0 | 0.5 |
| ... | ... | ... | ... |
| 129942 | 0 | 0 | 0.0 |
| 129929 | 0 | 0 | 0.0 |
| 129924 | 0 | 0 | 0.0 |
| 129911 | 0 | 0 | 0.0 |
| 153755 | 0 | 0 | 0.0 |
307511 rows × 3 columns
Tipos de variables categóricas y númericas¶
Se convierten las variables categóricas a tipo "category" y se muestran las primeras filas del DataFrame filtrado y convertido.
list_cat_vars, other = dame_variables_categoricas(dataset=pd_target_filter_null)
pd_target_filter_null[list_cat_vars] = pd_target_filter_null[list_cat_vars].astype("category")
pd_target_filter_null[list_cat_vars].head()
| FONDKAPREMONT_MODE | WALLSMATERIAL_MODE | HOUSETYPE_MODE | EMERGENCYSTATE_MODE | OCCUPATION_TYPE | NAME_TYPE_SUITE | CNT_CHILDREN | FLAG_DOCUMENT_8 | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_21 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_OWN_REALTY | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_12 | FLAG_PHONE | LIVE_CITY_NOT_WORK_CITY | REG_CITY_NOT_WORK_CITY | TARGET | REG_CITY_NOT_LIVE_CITY | LIVE_REGION_NOT_WORK_REGION | REG_REGION_NOT_WORK_REGION | REG_REGION_NOT_LIVE_REGION | HOUR_APPR_PROCESS_START | WEEKDAY_APPR_PROCESS_START | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | FLAG_EMAIL | FLAG_CONT_MOBILE | ORGANIZATION_TYPE | FLAG_WORK_PHONE | FLAG_EMP_PHONE | FLAG_MOBIL | NAME_HOUSING_TYPE | NAME_FAMILY_STATUS | NAME_EDUCATION_TYPE | NAME_INCOME_TYPE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | reg oper account | Stone, brick | block of flats | No | Laborers | Unaccompanied | 0 | 0 | Cash loans | M | N | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 10 | WEDNESDAY | 2 | 2 | 0 | 1 | Business Entity Type 3 | 0 | 1 | 1 | House / apartment | Single / not married | Secondary / secondary special | Working |
| 1 | reg oper account | Block | block of flats | No | Core staff | Family | 0 | 0 | Cash loans | F | N | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | N | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | MONDAY | 1 | 1 | 0 | 1 | School | 0 | 1 | 1 | House / apartment | Married | Higher education | State servant |
| 2 | NaN | NaN | NaN | NaN | Laborers | Unaccompanied | 0 | 0 | Revolving loans | M | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | MONDAY | 2 | 2 | 0 | 1 | Government | 1 | 1 | 1 | House / apartment | Single / not married | Secondary / secondary special | Working |
| 3 | NaN | NaN | NaN | NaN | Laborers | Unaccompanied | 0 | 0 | Cash loans | F | N | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17 | WEDNESDAY | 2 | 2 | 0 | 1 | Business Entity Type 3 | 0 | 1 | 1 | House / apartment | Civil marriage | Secondary / secondary special | Working |
| 4 | NaN | NaN | NaN | NaN | Core staff | Unaccompanied | 0 | 1 | Cash loans | M | N | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 11 | THURSDAY | 2 | 2 | 0 | 1 | Religion | 0 | 1 | 1 | House / apartment | Single / not married | Secondary / secondary special | Working |
list_cat_vars
['FONDKAPREMONT_MODE', 'WALLSMATERIAL_MODE', 'HOUSETYPE_MODE', 'EMERGENCYSTATE_MODE', 'OCCUPATION_TYPE', 'NAME_TYPE_SUITE', 'CNT_CHILDREN', 'FLAG_DOCUMENT_8', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_21', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_OWN_REALTY', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_12', 'FLAG_PHONE', 'LIVE_CITY_NOT_WORK_CITY', 'REG_CITY_NOT_WORK_CITY', 'TARGET', 'REG_CITY_NOT_LIVE_CITY', 'LIVE_REGION_NOT_WORK_REGION', 'REG_REGION_NOT_WORK_REGION', 'REG_REGION_NOT_LIVE_REGION', 'HOUR_APPR_PROCESS_START', 'WEEKDAY_APPR_PROCESS_START', 'REGION_RATING_CLIENT_W_CITY', 'REGION_RATING_CLIENT', 'FLAG_EMAIL', 'FLAG_CONT_MOBILE', 'ORGANIZATION_TYPE', 'FLAG_WORK_PHONE', 'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'NAME_HOUSING_TYPE', 'NAME_FAMILY_STATUS', 'NAME_EDUCATION_TYPE', 'NAME_INCOME_TYPE']
Se cuenta la frecuencia de cada categoría en la columna HOUSETYPE_MODE.
pd_target_filter_null['HOUSETYPE_MODE'].value_counts()
block of flats 150503 specific housing 1499 terraced house 1212 Name: HOUSETYPE_MODE, dtype: int64
Se cuenta la frecuencia de cada categoría en la columna FONDKAPREMONT_MODE.
pd_target_filter_null['FONDKAPREMONT_MODE'].value_counts()
reg oper account 73830 reg oper spec account 12080 not specified 5687 org spec account 5619 Name: FONDKAPREMONT_MODE, dtype: int64
pd_target_filter_null[list_cat_vars].dtypes
FONDKAPREMONT_MODE category WALLSMATERIAL_MODE category HOUSETYPE_MODE category EMERGENCYSTATE_MODE category OCCUPATION_TYPE category NAME_TYPE_SUITE category CNT_CHILDREN category FLAG_DOCUMENT_8 category NAME_CONTRACT_TYPE category CODE_GENDER category FLAG_OWN_CAR category FLAG_DOCUMENT_2 category FLAG_DOCUMENT_3 category FLAG_DOCUMENT_4 category FLAG_DOCUMENT_5 category FLAG_DOCUMENT_6 category FLAG_DOCUMENT_7 category FLAG_DOCUMENT_9 category FLAG_DOCUMENT_21 category FLAG_DOCUMENT_10 category FLAG_DOCUMENT_11 category FLAG_OWN_REALTY category FLAG_DOCUMENT_13 category FLAG_DOCUMENT_14 category FLAG_DOCUMENT_15 category FLAG_DOCUMENT_16 category FLAG_DOCUMENT_17 category FLAG_DOCUMENT_18 category FLAG_DOCUMENT_19 category FLAG_DOCUMENT_20 category FLAG_DOCUMENT_12 category FLAG_PHONE category LIVE_CITY_NOT_WORK_CITY category REG_CITY_NOT_WORK_CITY category TARGET category REG_CITY_NOT_LIVE_CITY category LIVE_REGION_NOT_WORK_REGION category REG_REGION_NOT_WORK_REGION category REG_REGION_NOT_LIVE_REGION category HOUR_APPR_PROCESS_START category WEEKDAY_APPR_PROCESS_START category REGION_RATING_CLIENT_W_CITY category REGION_RATING_CLIENT category FLAG_EMAIL category FLAG_CONT_MOBILE category ORGANIZATION_TYPE category FLAG_WORK_PHONE category FLAG_EMP_PHONE category FLAG_MOBIL category NAME_HOUSING_TYPE category NAME_FAMILY_STATUS category NAME_EDUCATION_TYPE category NAME_INCOME_TYPE category dtype: object
pd_target_filter_null[other].head(10)
| DAYS_ID_PUBLISH | DAYS_EMPLOYED | DAYS_BIRTH | SK_ID_CURR | |
|---|---|---|---|---|
| 0 | -2120 | -637 | -9461 | 100002 |
| 1 | -291 | -1188 | -16765 | 100003 |
| 2 | -2531 | -225 | -19046 | 100004 |
| 3 | -2437 | -3039 | -19005 | 100006 |
| 4 | -3458 | -3038 | -19932 | 100007 |
| 5 | -477 | -1588 | -16941 | 100008 |
| 6 | -619 | -3130 | -13778 | 100009 |
| 7 | -2379 | -449 | -18850 | 100010 |
| 8 | -3514 | 365243 | -20099 | 100011 |
| 9 | -3992 | -2019 | -14469 | 100012 |
print(pd_target_filter_null["DAYS_ID_PUBLISH"].value_counts().count())
pd_target_filter_null["DAYS_ID_PUBLISH"]\
.apply(lambda x: str(x).lower().strip()).value_counts(normalize=True)#.count()
6168
-4053 0.000550
-4095 0.000527
-4046 0.000524
-4417 0.000517
-4256 0.000514
...
-6216 0.000003
-6115 0.000003
-6197 0.000003
-5652 0.000003
-6129 0.000003
Name: DAYS_ID_PUBLISH, Length: 6168, dtype: float64
Ambas líneas devuelven el número 6168, indicando que antes y después de la normalización, hay 6168 valores únicos en DAYS_ID_PUBLISH.
print(pd_target_filter_null["DAYS_ID_PUBLISH"].value_counts().count())
pd_target_filter_null["DAYS_ID_PUBLISH"]\
.apply(lambda x: str(x).lower().strip()).value_counts(normalize=True).count()
6168
6168
En este apartado se convirtieron las variables categóricas a tipo "category", se guardaron en el dataframe pd_target_filter_null y se verificó que estuvieran normalizadas. Esto es un paso muy relevante para el análisis ya que posibilitará tratar de manera adecuada las variables categóricas y, de igual manera, facilitará la aplicación de técnicas de codificación y escalado necesarias para el modelado predictivo, asegurando la integridad y la consistencia de los resultados obtenidos.
Preprocesamiento inicial de algunas variables¶
# Verificar las Columnas del DataFrame
print(pd_target_filter_null.columns)
Index(['COMMONAREA_MEDI', 'COMMONAREA_AVG', 'COMMONAREA_MODE',
'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAPARTMENTS_MEDI', 'FONDKAPREMONT_MODE',
'LIVINGAPARTMENTS_MODE', 'LIVINGAPARTMENTS_AVG',
'LIVINGAPARTMENTS_MEDI',
...
'DAYS_ID_PUBLISH', 'DAYS_REGISTRATION', 'DAYS_EMPLOYED', 'DAYS_BIRTH',
'REGION_POPULATION_RELATIVE', 'NAME_HOUSING_TYPE', 'NAME_FAMILY_STATUS',
'NAME_EDUCATION_TYPE', 'NAME_INCOME_TYPE', 'SK_ID_CURR'],
dtype='object', length=122)
# Verificar Tipos de Datos
print(pd_target_filter_null.dtypes)
COMMONAREA_MEDI float64 COMMONAREA_AVG float64 COMMONAREA_MODE float64 NONLIVINGAPARTMENTS_MODE float64 NONLIVINGAPARTMENTS_AVG float64 NONLIVINGAPARTMENTS_MEDI float64 FONDKAPREMONT_MODE category LIVINGAPARTMENTS_MODE float64 LIVINGAPARTMENTS_AVG float64 LIVINGAPARTMENTS_MEDI float64 FLOORSMIN_AVG float64 FLOORSMIN_MODE float64 FLOORSMIN_MEDI float64 YEARS_BUILD_MEDI float64 YEARS_BUILD_MODE float64 YEARS_BUILD_AVG float64 OWN_CAR_AGE float64 LANDAREA_MEDI float64 LANDAREA_MODE float64 LANDAREA_AVG float64 BASEMENTAREA_MEDI float64 BASEMENTAREA_AVG float64 BASEMENTAREA_MODE float64 EXT_SOURCE_1 float64 NONLIVINGAREA_MODE float64 NONLIVINGAREA_AVG float64 NONLIVINGAREA_MEDI float64 ELEVATORS_MEDI float64 ELEVATORS_AVG float64 ELEVATORS_MODE float64 WALLSMATERIAL_MODE category APARTMENTS_MEDI float64 APARTMENTS_AVG float64 APARTMENTS_MODE float64 ENTRANCES_MEDI float64 ENTRANCES_AVG float64 ENTRANCES_MODE float64 LIVINGAREA_AVG float64 LIVINGAREA_MODE float64 LIVINGAREA_MEDI float64 HOUSETYPE_MODE category FLOORSMAX_MODE float64 FLOORSMAX_MEDI float64 FLOORSMAX_AVG float64 YEARS_BEGINEXPLUATATION_MODE float64 YEARS_BEGINEXPLUATATION_MEDI float64 YEARS_BEGINEXPLUATATION_AVG float64 TOTALAREA_MODE float64 EMERGENCYSTATE_MODE category OCCUPATION_TYPE category EXT_SOURCE_3 float64 AMT_REQ_CREDIT_BUREAU_HOUR float64 AMT_REQ_CREDIT_BUREAU_DAY float64 AMT_REQ_CREDIT_BUREAU_WEEK float64 AMT_REQ_CREDIT_BUREAU_MON float64 AMT_REQ_CREDIT_BUREAU_QRT float64 AMT_REQ_CREDIT_BUREAU_YEAR float64 NAME_TYPE_SUITE category OBS_30_CNT_SOCIAL_CIRCLE float64 DEF_30_CNT_SOCIAL_CIRCLE float64 OBS_60_CNT_SOCIAL_CIRCLE float64 DEF_60_CNT_SOCIAL_CIRCLE float64 EXT_SOURCE_2 float64 AMT_GOODS_PRICE float64 AMT_ANNUITY float64 CNT_FAM_MEMBERS float64 DAYS_LAST_PHONE_CHANGE float64 CNT_CHILDREN category FLAG_DOCUMENT_8 category NAME_CONTRACT_TYPE category CODE_GENDER category FLAG_OWN_CAR category FLAG_DOCUMENT_2 category FLAG_DOCUMENT_3 category FLAG_DOCUMENT_4 category FLAG_DOCUMENT_5 category FLAG_DOCUMENT_6 category FLAG_DOCUMENT_7 category FLAG_DOCUMENT_9 category FLAG_DOCUMENT_21 category FLAG_DOCUMENT_10 category FLAG_DOCUMENT_11 category FLAG_OWN_REALTY category FLAG_DOCUMENT_13 category FLAG_DOCUMENT_14 category FLAG_DOCUMENT_15 category FLAG_DOCUMENT_16 category FLAG_DOCUMENT_17 category FLAG_DOCUMENT_18 category FLAG_DOCUMENT_19 category FLAG_DOCUMENT_20 category FLAG_DOCUMENT_12 category AMT_CREDIT float64 AMT_INCOME_TOTAL float64 FLAG_PHONE category LIVE_CITY_NOT_WORK_CITY category REG_CITY_NOT_WORK_CITY category TARGET category REG_CITY_NOT_LIVE_CITY category LIVE_REGION_NOT_WORK_REGION category REG_REGION_NOT_WORK_REGION category REG_REGION_NOT_LIVE_REGION category HOUR_APPR_PROCESS_START category WEEKDAY_APPR_PROCESS_START category REGION_RATING_CLIENT_W_CITY category REGION_RATING_CLIENT category FLAG_EMAIL category FLAG_CONT_MOBILE category ORGANIZATION_TYPE category FLAG_WORK_PHONE category FLAG_EMP_PHONE category FLAG_MOBIL category DAYS_ID_PUBLISH int64 DAYS_REGISTRATION float64 DAYS_EMPLOYED int64 DAYS_BIRTH int64 REGION_POPULATION_RELATIVE float64 NAME_HOUSING_TYPE category NAME_FAMILY_STATUS category NAME_EDUCATION_TYPE category NAME_INCOME_TYPE category SK_ID_CURR int64 dtype: object
import pandas as pd
# Contar valores NaN en cada columna
nan_counts = pd_target_filter_null.isnull().sum()
# Mostrar el conteo de NaN por columna
print(nan_counts)
COMMONAREA_MEDI 214865 COMMONAREA_AVG 214865 COMMONAREA_MODE 214865 NONLIVINGAPARTMENTS_MODE 213514 NONLIVINGAPARTMENTS_AVG 213514 NONLIVINGAPARTMENTS_MEDI 213514 FONDKAPREMONT_MODE 210295 LIVINGAPARTMENTS_MODE 210199 LIVINGAPARTMENTS_AVG 210199 LIVINGAPARTMENTS_MEDI 210199 FLOORSMIN_AVG 208642 FLOORSMIN_MODE 208642 FLOORSMIN_MEDI 208642 YEARS_BUILD_MEDI 204488 YEARS_BUILD_MODE 204488 YEARS_BUILD_AVG 204488 OWN_CAR_AGE 202929 LANDAREA_MEDI 182590 LANDAREA_MODE 182590 LANDAREA_AVG 182590 BASEMENTAREA_MEDI 179943 BASEMENTAREA_AVG 179943 BASEMENTAREA_MODE 179943 EXT_SOURCE_1 173378 NONLIVINGAREA_MODE 169682 NONLIVINGAREA_AVG 169682 NONLIVINGAREA_MEDI 169682 ELEVATORS_MEDI 163891 ELEVATORS_AVG 163891 ELEVATORS_MODE 163891 WALLSMATERIAL_MODE 156341 APARTMENTS_MEDI 156061 APARTMENTS_AVG 156061 APARTMENTS_MODE 156061 ENTRANCES_MEDI 154828 ENTRANCES_AVG 154828 ENTRANCES_MODE 154828 LIVINGAREA_AVG 154350 LIVINGAREA_MODE 154350 LIVINGAREA_MEDI 154350 HOUSETYPE_MODE 154297 FLOORSMAX_MODE 153020 FLOORSMAX_MEDI 153020 FLOORSMAX_AVG 153020 YEARS_BEGINEXPLUATATION_MODE 150007 YEARS_BEGINEXPLUATATION_MEDI 150007 YEARS_BEGINEXPLUATATION_AVG 150007 TOTALAREA_MODE 148431 EMERGENCYSTATE_MODE 145755 OCCUPATION_TYPE 96391 EXT_SOURCE_3 60965 AMT_REQ_CREDIT_BUREAU_HOUR 41519 AMT_REQ_CREDIT_BUREAU_DAY 41519 AMT_REQ_CREDIT_BUREAU_WEEK 41519 AMT_REQ_CREDIT_BUREAU_MON 41519 AMT_REQ_CREDIT_BUREAU_QRT 41519 AMT_REQ_CREDIT_BUREAU_YEAR 41519 NAME_TYPE_SUITE 1292 OBS_30_CNT_SOCIAL_CIRCLE 1021 DEF_30_CNT_SOCIAL_CIRCLE 1021 OBS_60_CNT_SOCIAL_CIRCLE 1021 DEF_60_CNT_SOCIAL_CIRCLE 1021 EXT_SOURCE_2 660 AMT_GOODS_PRICE 278 AMT_ANNUITY 12 CNT_FAM_MEMBERS 2 DAYS_LAST_PHONE_CHANGE 1 CNT_CHILDREN 0 FLAG_DOCUMENT_8 0 NAME_CONTRACT_TYPE 0 CODE_GENDER 0 FLAG_OWN_CAR 0 FLAG_DOCUMENT_2 0 FLAG_DOCUMENT_3 0 FLAG_DOCUMENT_4 0 FLAG_DOCUMENT_5 0 FLAG_DOCUMENT_6 0 FLAG_DOCUMENT_7 0 FLAG_DOCUMENT_9 0 FLAG_DOCUMENT_21 0 FLAG_DOCUMENT_10 0 FLAG_DOCUMENT_11 0 FLAG_OWN_REALTY 0 FLAG_DOCUMENT_13 0 FLAG_DOCUMENT_14 0 FLAG_DOCUMENT_15 0 FLAG_DOCUMENT_16 0 FLAG_DOCUMENT_17 0 FLAG_DOCUMENT_18 0 FLAG_DOCUMENT_19 0 FLAG_DOCUMENT_20 0 FLAG_DOCUMENT_12 0 AMT_CREDIT 0 AMT_INCOME_TOTAL 0 FLAG_PHONE 0 LIVE_CITY_NOT_WORK_CITY 0 REG_CITY_NOT_WORK_CITY 0 TARGET 0 REG_CITY_NOT_LIVE_CITY 0 LIVE_REGION_NOT_WORK_REGION 0 REG_REGION_NOT_WORK_REGION 0 REG_REGION_NOT_LIVE_REGION 0 HOUR_APPR_PROCESS_START 0 WEEKDAY_APPR_PROCESS_START 0 REGION_RATING_CLIENT_W_CITY 0 REGION_RATING_CLIENT 0 FLAG_EMAIL 0 FLAG_CONT_MOBILE 0 ORGANIZATION_TYPE 0 FLAG_WORK_PHONE 0 FLAG_EMP_PHONE 0 FLAG_MOBIL 0 DAYS_ID_PUBLISH 0 DAYS_REGISTRATION 0 DAYS_EMPLOYED 0 DAYS_BIRTH 0 REGION_POPULATION_RELATIVE 0 NAME_HOUSING_TYPE 0 NAME_FAMILY_STATUS 0 NAME_EDUCATION_TYPE 0 NAME_INCOME_TYPE 0 SK_ID_CURR 0 dtype: int64
Imprime el resultado del conteo de NaN para que el analista pueda identificar qué columnas tienen valores faltantes y cuántos.
Es importante debido a que,
- Permite identificar columnas con un alto porcentaje de datos faltantes.
- Proporciona una guía para priorizar el tratamiento de estas columnas.
# Preprocesamiento de columnas con porcentajes manteniendo nulos
mask_int_rate = pd_target_filter_null['COMMONAREA_MEDI'].notna()
mask_revol = pd_target_filter_null['COMMONAREA_AVG'].notna()
pd_target_filter_null.loc[mask_int_rate, 'COMMONAREA_MEDI'] = pd_target_filter_null.loc[mask_int_rate, 'COMMONAREA_MEDI'].astype(str).str.replace('%', '')
pd_target_filter_null.loc[mask_revol, 'COMMONAREA_AVG'] = pd_target_filter_null.loc[mask_revol, 'COMMONAREA_AVG'].astype(str).str.replace('%', '')
pd_target_filter_null.loc[:,['COMMONAREA_MEDI', 'COMMONAREA_AVG']] = pd_target_filter_null.loc[:,['COMMONAREA_MEDI', 'COMMONAREA_AVG']].apply(pd.to_numeric, errors='coerce')
# Procesamiento de columnas numéricas preservando nulos
pd_target_filter_null['NONLIVINGAPARTMENTS_MODE'] = pd.to_numeric(pd_target_filter_null['NONLIVINGAPARTMENTS_MODE'], errors='coerce')
pd_target_filter_null['LIVINGAPARTMENTS_MODE'] = pd.to_numeric(pd_target_filter_null['LIVINGAPARTMENTS_MODE'], errors='coerce')
pd_target_filter_null['LIVINGAPARTMENTS_AVG'] = pd.to_numeric(pd_target_filter_null['LIVINGAPARTMENTS_AVG'], errors='coerce')
# Procesamiento de columnas categóricas manteniendo nulos
categorical_columns = ['NAME_HOUSING_TYPE', 'NAME_FAMILY_STATUS', 'NAME_EDUCATION_TYPE', 'NAME_INCOME_TYPE']
for col in categorical_columns:
pd_target_filter_null[col] = pd_target_filter_null[col].astype('category')
# Procesamiento de fechas preservando nulos
date_columns = ['DAYS_ID_PUBLISH', 'DAYS_REGISTRATION', 'DAYS_EMPLOYED', 'DAYS_BIRTH']
for col in date_columns:
pd_target_filter_null[col] = pd.to_numeric(pd_target_filter_null[col], errors='coerce')
# Si hay columnas con meses, mantenemos la estructura original pero protegiendo nulos
months = {
'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12
}
# Solo si existen columnas con fechas en formato mes-año
if 'application_date' in pd_target_filter_null.columns:
mask_dates = pd_target_filter_null['application_date'].notna()
pd_target_filter_null.loc[mask_dates, 'application_date_month'] = pd_target_filter_null.loc[mask_dates, 'application_date']\
.apply(lambda x: x.split('-')[0])\
.replace(months)
pd_target_filter_null.loc[mask_dates, 'application_date_year'] = pd_target_filter_null.loc[mask_dates, 'application_date']\
.apply(lambda x: x.split('-')[1])
pd_target_filter_null['application_date_month'] = pd.to_numeric(pd_target_filter_null['application_date_month'], errors='coerce')
pd_target_filter_null['application_date_year'] = pd.to_numeric(pd_target_filter_null['application_date_year'], errors='coerce')
Cada columna se maneja de acuerdo a su categoría: las columnas con porcentajes se transforman en valores numéricos, suprimiendo símbolos no deseados y manteniendo los valores nulos; las columnas categóricas se convierten en categoría para maximizar el uso de memoria y simplificar el análisis de frecuencias; y las fechas o valores temporales son tratadas numéricamente, facilitando cálculos matemáticos y análisis de tiempo más exactos. Adicionalmente, las columnas con meses y años en formato texto se fragmentan en dos variables distintas (mes y año), lo que incrementa su interpretación y beneficio en análisis de tiempo. La perspectiva contempla la conservación de valores nulos en todas las transformaciones, lo que resulta crucial para una gestión apropiada de los datos ausentes en etapas subsiguientes. Esta práctica garantiza que los datos se mantengan organizados y coherentes, al mismo tiempo que se disminuyen los errores al manejar diversos tipos de datos.
# Guardar conteo de nulos antes
nulos_antes = pd_target_filter_null.isnull().sum()
# Ejecutar el código de transformación
# Verificar nulos después
nulos_despues = pd_target_filter_null.isnull().sum()
# Comparar
print(nulos_antes == nulos_despues)
COMMONAREA_MEDI True COMMONAREA_AVG True COMMONAREA_MODE True NONLIVINGAPARTMENTS_MODE True NONLIVINGAPARTMENTS_AVG True NONLIVINGAPARTMENTS_MEDI True FONDKAPREMONT_MODE True LIVINGAPARTMENTS_MODE True LIVINGAPARTMENTS_AVG True LIVINGAPARTMENTS_MEDI True FLOORSMIN_AVG True FLOORSMIN_MODE True FLOORSMIN_MEDI True YEARS_BUILD_MEDI True YEARS_BUILD_MODE True YEARS_BUILD_AVG True OWN_CAR_AGE True LANDAREA_MEDI True LANDAREA_MODE True LANDAREA_AVG True BASEMENTAREA_MEDI True BASEMENTAREA_AVG True BASEMENTAREA_MODE True EXT_SOURCE_1 True NONLIVINGAREA_MODE True NONLIVINGAREA_AVG True NONLIVINGAREA_MEDI True ELEVATORS_MEDI True ELEVATORS_AVG True ELEVATORS_MODE True WALLSMATERIAL_MODE True APARTMENTS_MEDI True APARTMENTS_AVG True APARTMENTS_MODE True ENTRANCES_MEDI True ENTRANCES_AVG True ENTRANCES_MODE True LIVINGAREA_AVG True LIVINGAREA_MODE True LIVINGAREA_MEDI True HOUSETYPE_MODE True FLOORSMAX_MODE True FLOORSMAX_MEDI True FLOORSMAX_AVG True YEARS_BEGINEXPLUATATION_MODE True YEARS_BEGINEXPLUATATION_MEDI True YEARS_BEGINEXPLUATATION_AVG True TOTALAREA_MODE True EMERGENCYSTATE_MODE True OCCUPATION_TYPE True EXT_SOURCE_3 True AMT_REQ_CREDIT_BUREAU_HOUR True AMT_REQ_CREDIT_BUREAU_DAY True AMT_REQ_CREDIT_BUREAU_WEEK True AMT_REQ_CREDIT_BUREAU_MON True AMT_REQ_CREDIT_BUREAU_QRT True AMT_REQ_CREDIT_BUREAU_YEAR True NAME_TYPE_SUITE True OBS_30_CNT_SOCIAL_CIRCLE True DEF_30_CNT_SOCIAL_CIRCLE True OBS_60_CNT_SOCIAL_CIRCLE True DEF_60_CNT_SOCIAL_CIRCLE True EXT_SOURCE_2 True AMT_GOODS_PRICE True AMT_ANNUITY True CNT_FAM_MEMBERS True DAYS_LAST_PHONE_CHANGE True CNT_CHILDREN True FLAG_DOCUMENT_8 True NAME_CONTRACT_TYPE True CODE_GENDER True FLAG_OWN_CAR True FLAG_DOCUMENT_2 True FLAG_DOCUMENT_3 True FLAG_DOCUMENT_4 True FLAG_DOCUMENT_5 True FLAG_DOCUMENT_6 True FLAG_DOCUMENT_7 True FLAG_DOCUMENT_9 True FLAG_DOCUMENT_21 True FLAG_DOCUMENT_10 True FLAG_DOCUMENT_11 True FLAG_OWN_REALTY True FLAG_DOCUMENT_13 True FLAG_DOCUMENT_14 True FLAG_DOCUMENT_15 True FLAG_DOCUMENT_16 True FLAG_DOCUMENT_17 True FLAG_DOCUMENT_18 True FLAG_DOCUMENT_19 True FLAG_DOCUMENT_20 True FLAG_DOCUMENT_12 True AMT_CREDIT True AMT_INCOME_TOTAL True FLAG_PHONE True LIVE_CITY_NOT_WORK_CITY True REG_CITY_NOT_WORK_CITY True TARGET True REG_CITY_NOT_LIVE_CITY True LIVE_REGION_NOT_WORK_REGION True REG_REGION_NOT_WORK_REGION True REG_REGION_NOT_LIVE_REGION True HOUR_APPR_PROCESS_START True WEEKDAY_APPR_PROCESS_START True REGION_RATING_CLIENT_W_CITY True REGION_RATING_CLIENT True FLAG_EMAIL True FLAG_CONT_MOBILE True ORGANIZATION_TYPE True FLAG_WORK_PHONE True FLAG_EMP_PHONE True FLAG_MOBIL True DAYS_ID_PUBLISH True DAYS_REGISTRATION True DAYS_EMPLOYED True DAYS_BIRTH True REGION_POPULATION_RELATIVE True NAME_HOUSING_TYPE True NAME_FAMILY_STATUS True NAME_EDUCATION_TYPE True NAME_INCOME_TYPE True SK_ID_CURR True dtype: bool
El contraste entre los valores nulos previos y posteriores al procesamiento revela que no se registraron variaciones en la cantidad de valores nulos para ninguna de las columnas del dataset. Esto asegura que los cambios realizados en el preprocesamiento mantienen adecuadamente los valores ausentes, sin modificarlos ni suprimirlos de manera accidental durante las operaciones. Esta conservación garantiza que el preprocesamiento se concentró exclusivamente en cambiar los datos existentes (como transformar porcentajes en valores numéricos, gestionar categorías o desglosar fechas) sin alterar la estructura original del dataset. Este método posibilita el trabajo seguro, preservando la integridad de los datos y asegurando que los análisis posteriores representen con precisión la calidad inicial del conjunto de datos.
pd_target_filter_null.head()
| COMMONAREA_MEDI | COMMONAREA_AVG | COMMONAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAPARTMENTS_AVG | NONLIVINGAPARTMENTS_MEDI | FONDKAPREMONT_MODE | LIVINGAPARTMENTS_MODE | LIVINGAPARTMENTS_AVG | LIVINGAPARTMENTS_MEDI | FLOORSMIN_AVG | FLOORSMIN_MODE | FLOORSMIN_MEDI | YEARS_BUILD_MEDI | YEARS_BUILD_MODE | YEARS_BUILD_AVG | OWN_CAR_AGE | LANDAREA_MEDI | LANDAREA_MODE | LANDAREA_AVG | BASEMENTAREA_MEDI | BASEMENTAREA_AVG | BASEMENTAREA_MODE | EXT_SOURCE_1 | NONLIVINGAREA_MODE | NONLIVINGAREA_AVG | NONLIVINGAREA_MEDI | ELEVATORS_MEDI | ELEVATORS_AVG | ELEVATORS_MODE | WALLSMATERIAL_MODE | APARTMENTS_MEDI | APARTMENTS_AVG | APARTMENTS_MODE | ENTRANCES_MEDI | ENTRANCES_AVG | ENTRANCES_MODE | LIVINGAREA_AVG | LIVINGAREA_MODE | LIVINGAREA_MEDI | HOUSETYPE_MODE | FLOORSMAX_MODE | FLOORSMAX_MEDI | FLOORSMAX_AVG | YEARS_BEGINEXPLUATATION_MODE | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BEGINEXPLUATATION_AVG | TOTALAREA_MODE | EMERGENCYSTATE_MODE | OCCUPATION_TYPE | EXT_SOURCE_3 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | NAME_TYPE_SUITE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | EXT_SOURCE_2 | AMT_GOODS_PRICE | AMT_ANNUITY | CNT_FAM_MEMBERS | DAYS_LAST_PHONE_CHANGE | CNT_CHILDREN | FLAG_DOCUMENT_8 | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_21 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_OWN_REALTY | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_12 | AMT_CREDIT | AMT_INCOME_TOTAL | FLAG_PHONE | LIVE_CITY_NOT_WORK_CITY | REG_CITY_NOT_WORK_CITY | TARGET | REG_CITY_NOT_LIVE_CITY | LIVE_REGION_NOT_WORK_REGION | REG_REGION_NOT_WORK_REGION | REG_REGION_NOT_LIVE_REGION | HOUR_APPR_PROCESS_START | WEEKDAY_APPR_PROCESS_START | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | FLAG_EMAIL | FLAG_CONT_MOBILE | ORGANIZATION_TYPE | FLAG_WORK_PHONE | FLAG_EMP_PHONE | FLAG_MOBIL | DAYS_ID_PUBLISH | DAYS_REGISTRATION | DAYS_EMPLOYED | DAYS_BIRTH | REGION_POPULATION_RELATIVE | NAME_HOUSING_TYPE | NAME_FAMILY_STATUS | NAME_EDUCATION_TYPE | NAME_INCOME_TYPE | SK_ID_CURR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0144 | 0.0143 | 0.0144 | 0.0 | 0.0000 | 0.0000 | reg oper account | 0.022 | 0.0202 | 0.0205 | 0.1250 | 0.1250 | 0.1250 | 0.6243 | 0.6341 | 0.6192 | NaN | 0.0375 | 0.0377 | 0.0369 | 0.0369 | 0.0369 | 0.0383 | 0.083037 | 0.0 | 0.0000 | 0.00 | 0.00 | 0.00 | 0.0000 | Stone, brick | 0.0250 | 0.0247 | 0.0252 | 0.0690 | 0.0690 | 0.0690 | 0.0190 | 0.0198 | 0.0193 | block of flats | 0.0833 | 0.0833 | 0.0833 | 0.9722 | 0.9722 | 0.9722 | 0.0149 | No | Laborers | 0.139376 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | Unaccompanied | 2.0 | 2.0 | 2.0 | 2.0 | 0.262949 | 351000.0 | 24700.5 | 1.0 | -1134.0 | 0 | 0 | Cash loans | M | N | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 406597.5 | 202500.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 10 | WEDNESDAY | 2 | 2 | 0 | 1 | Business Entity Type 3 | 0 | 1 | 1 | -2120 | -3648.0 | -637 | -9461 | 0.018801 | House / apartment | Single / not married | Secondary / secondary special | Working | 100002 |
| 1 | 0.0608 | 0.0605 | 0.0497 | 0.0 | 0.0039 | 0.0039 | reg oper account | 0.079 | 0.0773 | 0.0787 | 0.3333 | 0.3333 | 0.3333 | 0.7987 | 0.8040 | 0.7960 | NaN | 0.0132 | 0.0128 | 0.0130 | 0.0529 | 0.0529 | 0.0538 | 0.311267 | 0.0 | 0.0098 | 0.01 | 0.08 | 0.08 | 0.0806 | Block | 0.0968 | 0.0959 | 0.0924 | 0.0345 | 0.0345 | 0.0345 | 0.0549 | 0.0554 | 0.0558 | block of flats | 0.2917 | 0.2917 | 0.2917 | 0.9851 | 0.9851 | 0.9851 | 0.0714 | No | Core staff | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Family | 1.0 | 0.0 | 1.0 | 0.0 | 0.622246 | 1129500.0 | 35698.5 | 2.0 | -828.0 | 0 | 0 | Cash loans | F | N | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | N | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1293502.5 | 270000.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | MONDAY | 1 | 1 | 0 | 1 | School | 0 | 1 | 1 | -291 | -1186.0 | -1188 | -16765 | 0.003541 | House / apartment | Married | Higher education | State servant | 100003 |
| 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 26.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Laborers | 0.729567 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Unaccompanied | 0.0 | 0.0 | 0.0 | 0.0 | 0.555912 | 135000.0 | 6750.0 | 1.0 | -815.0 | 0 | 0 | Revolving loans | M | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 135000.0 | 67500.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | MONDAY | 2 | 2 | 0 | 1 | Government | 1 | 1 | 1 | -2531 | -4260.0 | -225 | -19046 | 0.010032 | House / apartment | Single / not married | Secondary / secondary special | Working | 100004 |
| 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Laborers | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Unaccompanied | 2.0 | 0.0 | 2.0 | 0.0 | 0.650442 | 297000.0 | 29686.5 | 2.0 | -617.0 | 0 | 0 | Cash loans | F | N | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 312682.5 | 135000.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17 | WEDNESDAY | 2 | 2 | 0 | 1 | Business Entity Type 3 | 0 | 1 | 1 | -2437 | -9833.0 | -3039 | -19005 | 0.008019 | House / apartment | Civil marriage | Secondary / secondary special | Working | 100006 |
| 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Core staff | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Unaccompanied | 0.0 | 0.0 | 0.0 | 0.0 | 0.322738 | 513000.0 | 21865.5 | 1.0 | -1106.0 | 0 | 1 | Cash loans | M | N | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 513000.0 | 121500.0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 11 | THURSDAY | 2 | 2 | 0 | 1 | Religion | 0 | 1 | 1 | -3458 | -4311.0 | -3038 | -19932 | 0.028663 | House / apartment | Single / not married | Secondary / secondary special | Working | 100007 |
pd_target_filter_null.shape
(307511, 122)
import os
# Crear el directorio si no existe
os.makedirs("./data_preprocessing", exist_ok=True)
# Guardar el DataFrame procesado
pd_target_filter_null.to_csv("./data_preprocessing/bank_data_preprocessed.csv", index=False)
Finalmente, se hizo un preprocesamiento inicial de algunas variables. Así pues, se imprimieron los nombres de las columnas y sus tipos de datos para revisar la estructura del DataFrame pd_target_filter_null. Se contaron los valores nulos en cada columna para evaluar la calidad de los datos y se verificó que todas las columnas estuvieran correctamente preparadas para la transformación. Estos pasos iniciales son esenciales para entender el estado actual de los datos, identificar posibles problemas de calidad y asegurar que las transformaciones que se harán en el futuro se realicen sobre datos consistentes y limpios.
Posteriormente, se compararon los valores nulos antes y después de las transformaciones para confirmar que no se introdujeron nuevos valores nulos durante el procesamiento. Este tipo de verificación es crucial para mantener la integridad de los datos y garantizar que las operaciones de preprocesamiento no afecten negativamente la calidad de los datos. Por último, el DataFrame procesado se guardó en un archivo CSV en un directorio específico, asegurando que los datos preparados están listos para ser empleados en futuras etapas del análisis y modelado predictivo. Este proceso de validación y almacenamiento es primordial para alcanzar una gestión eficiente y precisa de los datos.